Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Looping to import delimited a large .csv by parts with range `forgets´ variables

    Hello.
    I want to work with an csv. The file is more than 20 GB heavy so i cannot read it normally. I was trying to import it by parts using rowrange, to filter what i need, save it and then append them.

    The issue is that at some point, the import doesn't keep the variables names and names them based on the first observation of that chunk or something like that.
    I have tried different chunk sizes, starting the loop at different points ans this keeps coming up, at differents points of the data. Any thoughts on why this happens?
    (More than welcome to hear other possible approaches to this as well).

    Here is my code:

    {
    Code:
    local k = 1
    local j = 500000
    local h = 1
    
    while k < 15000000 {
       import delimited using "file.csv", rowrange(`k':`j') clear
      
       replace estado = "MEXICO" if estado == "MÃXICO"
       replace estado = "CDMX" if estado == "DISTRITO FEDERAL"
       keep if inlist(estado, "CDMX", "MEXICO", "HIDALGO")                  // This drops around 70% of the data, on average
      
       save file_chunk_`h'.dta, replace
    
       local k = `k' + 500000
       local j = `j' + 500000
       local h = `h' + 1
    
    }
    The size of the chunk (500,000) and the limit of the while (15 mill) are arbitrary. I don't know how many observations i have (I believe something around 65 mill). As I said, I have tried different chunk sizes and lenghts of the while. With this numbers, at the 15th loop, when it pulls from 7,000,001 to 7,500,000 it does it "succesfully" but forgets the variable names and when i try to to replace estado it naturally issues a "variable estado not found r(111);" I tried independently importing from 6,900,000 to 7,100,000, from 7,100,000 to 7,200,000 and so on and it works perfectly,

    Any suggestions?

    Thanks in advance.



    Carlos Aburto Castellanos
    Last edited by Carlos Aburto-Castellanos; 08 Sep 2020, 14:18.

  • #2
    What you describe does sound like a bug in -import delimited-, but I would nevertheless try your code on a much smaller file with appropriately adjusted k and j.

    As for a workaround: There is a community contributed program -chunky-, available at SSC, that will break a large text file into a bunch of smaller chunks. I wonder if you could use it, import each chunk, and then put them together with append. You would need to find a convenient way to insert the variable name line into the beginning of each chunk, about which I don't have any quick ideas.

    Comment


    • #3
      Hi, thanks very much for the reply, Mike.

      -chunky- works great. It is very simple and even before actually chunking the data,
      Code:
       chunky using ..., analyze
      is very useful for knowing what you are working with.

      Regarding the variable names, the option - header(include) - takes perfectly good care of that.

      Thanks a lot,

      C

      Comment


      • #4
        I'm very new to stata and trying to do a very similar thing (I'm self taught). I'm able to use the chunky command, however, I need to filter out parts of the data depending on a single variable and then save those data points as chunks (my data set has 50 million+ values in it)

        I'm using this code for the chunky command:

        chunky using "/Users/katelinphelps/Documents/ORIGINAL DATA SET.txt", chunksize(1 GB) header(include) stub("allvalues")

        And this code to filter out the data i need
        keep if deedsitusstatestatic == "NY"

        But how do I combine those two commands into one set of code?

        Comment

        Working...
        X